Exploratory Data Analysis

Written for Stat 140XP

February 26 2024

Introduction

  • Exploratory data analysis involves the use of graphics, summaries and transformations to better understand the data and answer our questions.
  • Exploratory methods, make few assumptions about the distributional shape of the data.
  • Our results can help us refine our questions or develop new ones.

Here’s what Wickham says in “R for Data Science”: (page 82)

Your goal during EDA is to develop an understanding of your data. The easiest way to do this is to use questions as tools to guide your investigation. When you ask a question, the question focuses your attention on a specific part of your dataset and helps you decide which graphs, models, or transformations to make.

There are no “real” rules here, but the data I’m using today comes from a competition, so obviously winning was (the competition is closed) the goal. But to win, it was important to focus on two things

  1. Variation
  2. Covariation

Loading Libraries

library(tidyverse)
library(DT)
library(lubridate)
library(ggthemes)
library(RColorBrewer)

Case Study

library(readr)
train <- read_csv("train.csv.gz")

A .gz or a .zip is a compressed file. Compression saves space on our machines. The functions in readr can read those directly (some functions require decompression first)

  • The dataset train contains the following variables, take a look:
datatable(train)

more compactly

glimpse(train)
Rows: 26,729
Columns: 10
$ AnimalID       <chr> "A671945", "A656520", "A686464", "A683430", "A667013", …
$ Name           <chr> "Hambone", "Emily", "Pearce", NA, NA, "Elsa", "Jimmy", …
$ DateTime       <dttm> 2014-02-12 18:22:00, 2013-10-13 12:44:00, 2015-01-31 1…
$ OutcomeType    <chr> "Return_to_owner", "Euthanasia", "Adoption", "Transfer"…
$ OutcomeSubtype <chr> NA, "Suffering", "Foster", "Partner", "Partner", "Partn…
$ AnimalType     <chr> "Dog", "Cat", "Dog", "Cat", "Dog", "Dog", "Cat", "Cat",…
$ SexuponOutcome <chr> "Neutered Male", "Spayed Female", "Neutered Male", "Int…
$ AgeuponOutcome <chr> "1 year", "1 year", "2 years", "3 weeks", "2 years", "1…
$ Breed          <chr> "Shetland Sheepdog Mix", "Domestic Shorthair Mix", "Pit…
$ Color          <chr> "Brown/White", "Cream Tabby", "Blue/White", "Blue Cream…

Initial Steps

  • The data needs to be prepared or processed for analysis (data cleaning)
  • An important issue will always be missing values.
  • na.omit( ) or complete.cases() can be used. (In more advanced courses, substitution of missing values instead of deletion)

missing values

train %>% summarise_all(funs(sum(is.na(.))))
# A tibble: 1 × 10
  AnimalID  Name DateTime OutcomeType OutcomeSubtype AnimalType SexuponOutcome
     <int> <int>    <int>       <int>          <int>      <int>          <int>
1        0  7691        0           0          13612          0              1
# ℹ 3 more variables: AgeuponOutcome <int>, Breed <int>, Color <int>

substitutions

train <- train %>%
     mutate(NoName = is.na(Name))
table(train$NoName)

FALSE  TRUE 
19038  7691 

Make some decisions about the remainder. Real life data is always messy so make sure you use the useNA = option when tabling data:

table(train$OutcomeSubtype, useNA = "always")

         Aggressive              At Vet                Barn            Behavior 
                320                   4                   2                  86 
Court/Investigation             Enroute              Foster           In Foster 
                  6                   8                1800                  52 
          In Kennel          In Surgery             Medical             Offsite 
                114                   3                  66                 165 
            Partner         Rabies Risk                SCRP           Suffering 
               7816                  74                1599                1002 
               <NA> 
              13612 
table(train$SexuponOutcome, useNA = "always")

Intact Female   Intact Male Neutered Male Spayed Female       Unknown 
         3511          3525          9779          8820          1093 
         <NA> 
            1 
table(train$AgeuponOutcome, useNA = "always")

  0 years     1 day   1 month    1 week   1 weeks    1 year 10 months  10 years 
       22        66      1281       146       171      3969       457       446 
11 months  11 years  12 years  13 years  14 years  15 years  16 years  17 years 
      166       126       234       143        97        85        36        17 
 18 years  19 years    2 days  2 months   2 weeks   2 years  20 years    3 days 
       10         3        99      3397       529      3742         2       109 
 3 months   3 weeks   3 years    4 days  4 months   4 weeks   4 years    5 days 
     1277       659      1823        50       888       334      1071        24 
 5 months   5 weeks   5 years    6 days  6 months   6 years  7 months   7 years 
      652        11       992        50       588       670       288       531 
 8 months   8 years  9 months   9 years      <NA> 
      402       536       224       288        18 
train[, c(5,7,8)] <- train %>% select(OutcomeSubtype, SexuponOutcome, AgeuponOutcome) %>%
     mutate_each(funs(replace(., is.na(.), "Unknown")))
train %>% summarise_all(funs(sum(is.na(.))))
# A tibble: 1 × 11
  AnimalID  Name DateTime OutcomeType OutcomeSubtype AnimalType SexuponOutcome
     <int> <int>    <int>       <int>          <int>      <int>          <int>
1        0  7691        0           0              0          0              0
# ℹ 4 more variables: AgeuponOutcome <int>, Breed <int>, Color <int>,
#   NoName <int>

Fixing that AgeuponOutcome variable

head(train$AgeuponOutcome)
[1] "1 year"  "1 year"  "2 years" "3 weeks" "2 years" "1 month"
train$AgeuponOutcome <- gsub('s', '', train$AgeuponOutcome)
table(train$AgeuponOutcome)

  0 year    1 day  1 month   1 week   1 year 10 month  10 year 11 month 
      22       66     1281      317     3969      457      446      166 
 11 year  12 year  13 year  14 year  15 year  16 year  17 year  18 year 
     126      234      143       97       85       36       17       10 
 19 year    2 day  2 month   2 week   2 year  20 year    3 day  3 month 
       3       99     3397      529     3742        2      109     1277 
  3 week   3 year    4 day  4 month   4 week   4 year    5 day  5 month 
     659     1823       50      888      334     1071       24      652 
  5 week   5 year    6 day  6 month   6 year  7 month   7 year  8 month 
      11      992       50      588      670      288      531      402 
  8 year  9 month   9 year  Unknown 
     536      224      288       18 

create a numeric age value:

train$Age <- sapply(train$AgeuponOutcome,  function(x) strsplit(x, split = ' ')[[1]][1]) %>% as.numeric()
head(train$Age)
[1] 1 1 2 3 2 1

correct the various units of time

The vectorized function ifelse() works nicely here

train$Unit <- sapply(train$AgeuponOutcome,  function(x) strsplit(x, split = ' ')[[1]][2])
train$AgeInDays <- ifelse(train$Unit == 'day', train$Age,
              ifelse(train$Unit == 'week', train$Age*7,
              ifelse(train$Unit == 'month', train$Age*30, 
              ifelse(train$Unit == 'year', train$Age*365, NA))))

Always check your work

datatable(train[, c(8, 12:14)])

Making more of that DateTime

train$hour    <- hour(train$DateTime)
train$DOW <- wday(train$DateTime)
train$month   <- month(train$DateTime)
train$year    <- year(train$DateTime)

Consider squeezing more information out of the data, Time of day can help us figure out when they are open:

plot(table(train$hour), type="h")

train$period   <- ifelse(train$hour > 8 & train$hour < 11, 'early',
                  ifelse(train$hour > 10 & train$hour < 17, 'midday',
                  ifelse(train$hour > 16 & train$hour < 19, 'endday', 'overnight')))

there is a problem with the ordering

ggplot(train, aes(x = period, y = AgeInDays)) + geom_boxplot() + theme_classic()

Use factor levels to order non-numeric data

train$period <- factor(train$period, 
                    levels = c('early', 'midday',
                               'endday', 'overnight'))

check the results

ggplot(train, aes(x = period, y = AgeInDays)) + 
    geom_boxplot(fill=c("red", "orange","yellow", "green")) + 
    theme_classic()

Always, Always Check The Data

You should probably run a quick summary or some other function which generates ranges and tabulates NAs to make certain that there are no unusual observations

summary(train)
   AnimalID             Name              DateTime                     
 Length:26729       Length:26729       Min.   :2013-10-01 09:31:00.00  
 Class :character   Class :character   1st Qu.:2014-05-31 16:31:00.00  
 Mode  :character   Mode  :character   Median :2014-12-13 17:10:00.00  
                                       Mean   :2014-12-19 00:22:23.96  
                                       3rd Qu.:2015-07-19 19:48:00.00  
                                       Max.   :2016-02-21 19:17:00.00  
                                                                       
 OutcomeType        OutcomeSubtype      AnimalType        SexuponOutcome    
 Length:26729       Length:26729       Length:26729       Length:26729      
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
 AgeuponOutcome        Breed              Color             NoName       
 Length:26729       Length:26729       Length:26729       Mode :logical  
 Class :character   Class :character   Class :character   FALSE:19038    
 Mode  :character   Mode  :character   Mode  :character   TRUE :7691     
                                                                         
                                                                         
                                                                         
                                                                         
      Age             Unit             AgeInDays           hour      
 Min.   : 0.000   Length:26729       Min.   :   0.0   Min.   : 0.00  
 1st Qu.: 2.000   Class :character   1st Qu.:  60.0   1st Qu.:12.00  
 Median : 2.000   Mode  :character   Median : 365.0   Median :15.00  
 Mean   : 3.628                      Mean   : 794.1   Mean   :14.45  
 3rd Qu.: 5.000                      3rd Qu.:1095.0   3rd Qu.:17.00  
 Max.   :20.000                      Max.   :7300.0   Max.   :23.00  
 NA's   :18                          NA's   :18                      
      DOW            month             year            period     
 Min.   :1.000   Min.   : 1.000   Min.   :2013   early    : 1683  
 1st Qu.:2.000   1st Qu.: 4.000   1st Qu.:2014   midday   :15195  
 Median :4.000   Median : 7.000   Median :2014   endday   : 7846  
 Mean   :3.976   Mean   : 6.926   Mean   :2014   overnight: 2005  
 3rd Qu.:6.000   3rd Qu.:10.000   3rd Qu.:2015                    
 Max.   :7.000   Max.   :12.000   Max.   :2016                    
                                                                  

consider building a dplyr solution

library(dplyr)
library(tidyr)
train %>% select(14:18) %>% 
            summarise_each(funs(min = min(., na.rm = TRUE), 
                      q25 = quantile(.,probs = 0.25, na.rm = TRUE), 
                      median = median(.,na.rm = TRUE), 
                      q75 = quantile(.,probs = 0.75, na.rm = TRUE), 
                      max = max(.,na.rm = TRUE),
                      mean = mean(.,na.rm = TRUE), 
                      sd = sd(.,na.rm = TRUE)))  %>%
gather(stat, val) %>%
separate(stat, into = c("var", "stat"), sep = "_") %>%
spread(stat, val) %>%
select(var, min, q25, median, q75, max, mean, sd) # reorder columns
# A tibble: 5 × 8
  var         min   q25 median   q75   max    mean       sd
  <chr>     <dbl> <dbl>  <dbl> <dbl> <dbl>   <dbl>    <dbl>
1 AgeInDays     0    60    365  1095  7300  794.   1083.   
2 DOW           1     2      4     6     7    3.98    2.07 
3 hour          0    12     15    17    23   14.4     3.34 
4 month         1     4      7    10    12    6.93    3.50 
5 year       2013  2014   2014  2015  2016 2014.      0.741

Variables of Substance

Knowing the number of unique values can help us determine whether a variable is suitable for analysis. Sometimes, a variable with many unique values is really just an identifier that is more useful for joining (merging) than for analysis. An example would be student id number, social security numbers, cell phone number etc.

You can use functions such as unique() to find out how many unique values a given variable has:

sapply(train, function(x) length(unique(x)))
      AnimalID           Name       DateTime    OutcomeType OutcomeSubtype 
         26729           6373          22918              5             17 
    AnimalType SexuponOutcome AgeuponOutcome          Breed          Color 
             2              5             44           1380            366 
        NoName            Age           Unit      AgeInDays           hour 
             2             22              5             44             20 
           DOW          month           year         period 
             7             12              4              4 
train %>% summarise_all(funs(n_distinct(.)))
# A tibble: 1 × 19
  AnimalID  Name DateTime OutcomeType OutcomeSubtype AnimalType SexuponOutcome
     <int> <int>    <int>       <int>          <int>      <int>          <int>
1    26729  6373    22918           5             17          2              5
# ℹ 12 more variables: AgeuponOutcome <int>, Breed <int>, Color <int>,
#   NoName <int>, Age <int>, Unit <int>, AgeInDays <int>, hour <int>,
#   DOW <int>, month <int>, year <int>, period <int>

tidyr is useful for this kind of aggressive data restructuring

train %>%
  gather(var, value) %>% 
  distinct() %>% 
  count(var)
# A tibble: 19 × 2
   var                n
   <chr>          <int>
 1 Age               22
 2 AgeInDays         44
 3 AgeuponOutcome    44
 4 AnimalID       26729
 5 AnimalType         2
 6 Breed           1380
 7 Color            366
 8 DOW                7
 9 DateTime       22918
10 Name            6373
11 NoName             2
12 OutcomeSubtype    17
13 OutcomeType        5
14 SexuponOutcome     5
15 Unit               5
16 hour              20
17 month             12
18 period             4
19 year               4

Variables with few values can serve as grouping variables (good variables to use for subsetting).

String work

The variables that typically have too many values are strings, but we can be creative. There are 366 colors, too many to view on screen, but we can table and save to a data frame:

colors <- train %>% count(Color)
datatable(colors)
colors <- colors %>% mutate(color_rank = dense_rank(desc(n)))
datatable(colors)
train2 <- inner_join(train, colors[, c(1,3)])

breed

There is a lot of information here, but we can work with it using string functions.

train2$mix <- train2$Breed %>% str_detect(., "/|Mix") 
datatable(train2[, c("Breed","mix")])

You could probably think of more things we could do with our variables, like popular names, breed information, etc.

  • We begin with graphical tools to generate basic plots of the data to illuminate patterns
  • We then move into more complex plots to search for deviations from these patterns

Variation

  • Variability is interesting and typically one will examine variables with the most variation.
  • There is variability in balance
  • What is the source(s) of this variation?

One of the most important things to remember is the type of variable – do I have a numerical or categorical and if I have a numerical – is it discrete or continuous?

ggplot(train2, aes(x=AgeInDays)) + 
    geom_histogram(aes(y=..density..),bins=20, fill="cyan", color="black") +
    ggtitle("Age distribution of shelter animals at outcome") + 
    theme_wsj()

ggplot(train2, aes(x=AgeInDays)) + 
    geom_histogram(aes(y=..density..),bins=20, fill="cyan", color="black") +
    facet_wrap(~OutcomeType) +
    ggtitle("Age distribution of shelter animals at outcome") + 
    theme_tufte()

ggplot(train2, aes(x=AgeInDays,fill=factor(AnimalType))) + 
    geom_histogram(aes(y=..density..)) +
    facet_wrap(~OutcomeType + AnimalType) +
    ggtitle("Age distribution of shelter animals at outcome") + 
    theme_solarized() + theme(legend.position="none")

ggplot(train2, aes(x=AgeInDays,fill=factor(AnimalType))) + 
    geom_histogram(aes(y=..density..), color = "black") +
    facet_wrap(OutcomeType ~ AnimalType, nrow=5) +
    ggtitle("Age distribution of shelter animals at outcome") + 
    theme_pander() + theme(legend.position="none")

Outliers or Unusual Values

We might transform the variable “AgeInDays” to help us better understand it or perhaps make it more presentable.

Usually long right hand tails can use a square root or a log transformation to help normalize them or at least bring the relationship into clearer focus:

train2$AO <- paste(train2$AnimalType, train2$OutcomeType, sep="-")
ggplot(train2, aes(x=AO, y=sqrt(AgeInDays))) + 
    geom_boxplot(fill=rep(c("red","blue"), each=5)) + 
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) 

So perhaps outliers are a combination of age, animal type and outcome. Old cats are not usually adopted (and old for a cat is about 2 years) and a dog is not “old” unil 6 or 7 years of age.

Working with groups

Tables are a start when working with categorical (non-numeric) variables

table(train2$OutcomeType, train2$SexuponOutcome, train2$AnimalType)
, ,  = Cat

                 
                  Intact Female Intact Male Neutered Male Spayed Female Unknown
  Adoption                  141          97          2001          2033       0
  Died                       40          64             9            11      23
  Euthanasia                206         231           108            75      90
  Return_to_owner            18          10           252           213       7
  Transfer                 1709        1525           695           680     896

, ,  = Dog

                 
                  Intact Female Intact Male Neutered Male Spayed Female Unknown
  Adoption                   62          61          3221          3153       0
  Died                       16          15            10             7       2
  Euthanasia                195         246           236           157      11
  Return_to_owner           283         467          1995          1535       6
  Transfer                  841         809          1252           956      59
prop.table(table(train2$OutcomeType, train2$SexuponOutcome, train2$AnimalType),c(1,3))
, ,  = Cat

                 
                  Intact Female Intact Male Neutered Male Spayed Female
  Adoption          0.033005618 0.022705993   0.468398876   0.475889513
  Died              0.272108844 0.435374150   0.061224490   0.074829932
  Euthanasia        0.290140845 0.325352113   0.152112676   0.105633803
  Return_to_owner   0.036000000 0.020000000   0.504000000   0.426000000
  Transfer          0.310445050 0.277020890   0.126248865   0.123524069
                 
                      Unknown
  Adoption        0.000000000
  Died            0.156462585
  Euthanasia      0.126760563
  Return_to_owner 0.014000000
  Transfer        0.162761126

, ,  = Dog

                 
                  Intact Female Intact Male Neutered Male Spayed Female
  Adoption          0.009542866 0.009388949   0.495767277   0.485300908
  Died              0.320000000 0.300000000   0.200000000   0.140000000
  Euthanasia        0.230769231 0.291124260   0.279289941   0.185798817
  Return_to_owner   0.066028931 0.108959403   0.465468969   0.358142790
  Transfer          0.214705131 0.206535614   0.319632372   0.244064335
                 
                      Unknown
  Adoption        0.000000000
  Died            0.040000000
  Euthanasia      0.013017751
  Return_to_owner 0.001399907
  Transfer        0.015062548
table(train2$OutcomeType, train2$SexuponOutcome, train2$AnimalType) %>% 
    prop.table(c(2,3)) %>%
    round(3)
, ,  = Cat

                 
                  Intact Female Intact Male Neutered Male Spayed Female Unknown
  Adoption                0.067       0.050         0.653         0.675   0.000
  Died                    0.019       0.033         0.003         0.004   0.023
  Euthanasia              0.097       0.120         0.035         0.025   0.089
  Return_to_owner         0.009       0.005         0.082         0.071   0.007
  Transfer                0.808       0.791         0.227         0.226   0.882

, ,  = Dog

                 
                  Intact Female Intact Male Neutered Male Spayed Female Unknown
  Adoption                0.044       0.038         0.480         0.543   0.000
  Died                    0.011       0.009         0.001         0.001   0.026
  Euthanasia              0.140       0.154         0.035         0.027   0.141
  Return_to_owner         0.203       0.292         0.297         0.264   0.077
  Transfer                0.602       0.506         0.186         0.165   0.756

barplot

ggplot(train2, aes(x = SexuponOutcome, y = ..count.., fill = OutcomeType)) +
  geom_bar(stat = 'count', position = 'fill', color = 'black') +
  facet_wrap(~AnimalType) +
  coord_flip() +
  labs(y = 'Proportion of Animals', 
       x = 'Sex',
       title = 'Outcomes by Day of Week') +
    theme_fivethirtyeight()

heat map

my_palette <- brewer.pal(n = 11, name = "Spectral")

 train2 %>%
      count(AO, period) %>%
      ggplot(mapping = aes(x = AO, y = period)) +
        geom_tile(mapping = aes(fill = n), alpha = 0.75) +
        scale_fill_gradientn(colors = my_palette)  +
        labs(x = "Animal Outcome", y = "Time Period", title = "Animal Outcomes by time period") +
        theme(axis.text.x = element_text(angle = 90, hjust = 1))